Optimal outlier removal in high-dimensional spaces

نویسندگان

  • John Dunagan
  • Santosh Vempala
چکیده

We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. As in [BFKV 99], a point x is defined to be a β-outlier if there exists some direction w in which its squared distance from the mean along w is greater than β times the average squared distance from the mean along w. Our main theorem is that for any ǫ > 0, there exists a (1− ǫ) fraction of the original distribution that has no O(n ǫ (b+log n ǫ ))-outliers, improving on the previous bound of O(nb/ǫ). This is asymptotically the best possible, as shown by a matching lower bound. The theorem is constructive, and results in a 1 1−ǫ approximation to the following optimization problem: given a distribution μ (i.e. the ability to sample from it), and a parameter ǫ > 0, find the minimum β for which there exists a subset of probability at least (1− ǫ) with no β-outliers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RNN (Reverse Nearest Neighbour) in Unproven Reserve Based Outlier Discovery

Outlier detection refers to task of identifying patterns. They don’t conform establish regular behavior. Outlier detection in highdimensional data presents various challenges resulting from the “curse of dimensionality”. The current view is that distance concentration that is tendency of distances in high-dimensional data to become in discernible making distance-based methods label all points a...

متن کامل

Feature Extraction for Outlier Detection in High-Dimensional Spaces

This work addresses the problem of feature extraction for boosting the performance of outlier detectors in high-dimensional spaces. Recent years have observed the prominence of multidimensional data on which traditional detection techniques usually fail to work as expected due to the curse of dimensionality. This paper introduces an efficient feature extraction method which brings nontrivial im...

متن کامل

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

A Robust Method for Detecting DB-Outliers from High Dimensional Datasets

Outlier detection is a popular technique that can be utilized in many modern applications like financial analysis and fraud detection. As data description becomes complex, operated datasets’ dimensionalities keep monotone increasing. However, current researches find that it is extremely difficult to pick out outliers directly from high dimensional datasets owing to the curse of dimensionality. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Comput. Syst. Sci.

دوره 68  شماره 

صفحات  -

تاریخ انتشار 2004